Prosperative Public Forum

Google Index Miner

0 Members and 1 Guest are viewing this topic.

Google Index Miner
« on: November 17, 2011, 01:07:05 PM »
Don't know if this one's commercial enough - but I'd certainly buy it, so perhaps other webmasters with big sites would too:

Since Google Panda has slammed lots of big sites for having "low quality" pages on their domain - it's become important to look hard at the quality of every page on your domain that google has in its index.

But if you have a big site (or in fact anything over 1,000 pages) - it's almost impossible to get a list of all the URLs that Google has in its index.

For example - we have a site where a SITE: query is reporting 135,000 pages indexed - but Google will only return 1,000 urls.  So you're left guessing as to what the rest are.

I reckon we've got around 100,000 real pages - so Google's indexed around 35,000 strange variations, pages with query strings, or stuff I've plain forgotten about.

A utility that could extract a full list (or even a big list) of all the pages google has in its index for a particular domain would be incredibly useful - once you know what they've got, you could use NoIndex and robots.txt to clean it up.

I reckon this might be possible by querying Google with site: and words that only appear on a few pages on your site - and then de-duping the results, to create a list of unique pages in Google's index.

Might not be possible to be exhaustive - but I reckon it would be easy to get way beyond the 1,000 or so URLs that Google will return.

Offline HoneyJo

  • *****
  • 1655
  • HoneyJo
    • View Profile
    • 'American Freelance Writer'
    • Email
Re: Google Index Miner
« Reply #1 on: November 17, 2011, 02:32:53 PM »
Don't know if this one's commercial enough - but I'd certainly buy it, so perhaps other webmasters with big sites would too:
Since Google Panda has slammed lots of big sites for having "low quality" pages on their domain - it's become important to look hard at the quality of every page on your domain that google has in its index.
But if you have a big site (or in fact anything over 1,000 pages) - it's almost impossible to get a list of all the URLs that Google has in its index.
For example - we have a site where a SITE: query is reporting 135,000 pages indexed - but Google will only return 1,000 urls.  So you're left guessing as to what the rest are.
I reckon we've got around 100,000 real pages - so Google's indexed around 35,000 strange variations, pages with query strings, or stuff I've plain forgotten about.
A utility that could extract a full list (or even a big list) of all the pages google has in its index for a particular domain would be incredibly useful - once you know what they've got, you could use NoIndex and robots.txt to clean it up.
I reckon this might be possible by querying Google with site: and words that only appear on a few pages on your site - and then de-duping the results, to create a list of unique pages in Google's index.
Might not be possible to be exhaustive - but I reckon it would be easy to get way beyond the 1,000 or so URLs that Google will return.

7Driver, 8)

I'll 2nd this one!

This sounds like a real good idea! And I would also be very interested in a product that applies that much
knowledge.

But, what's up with the 'I reckon'? ???  'I know that you know', you ain't no Georgia or East Texas boy!!! ;)

HoneyJo

'I haven't lost my mind, it's backed up on my hard-drive somewhere!'
American Freelance Writer

Re: Google Index Miner
« Reply #2 on: November 18, 2011, 02:00:25 AM »
 ;D

Re: Google Index Miner
« Reply #3 on: November 18, 2011, 02:10:08 AM »
Hi

I can see a huge problem with this.

Google only returns 1000 pages MAX so if you're mining google, how do you get past the 1000 page mark.

The tool couldn't go to page 1001 because it's not there.

Just my thoughts
Eliot

Offline HoneyJo

  • *****
  • 1655
  • HoneyJo
    • View Profile
    • 'American Freelance Writer'
    • Email
Re: Google Index Miner
« Reply #4 on: November 18, 2011, 09:16:55 AM »
Hi  I can see a huge problem with this.
Google only returns 1000 pages MAX so if you're mining google, how do you get past the 1000 page mark.
The tool couldn't go to page 1001 because it's not there.  Just my thoughts    Eliot   
;D 

Eliot, You are probably right, but hey the big G if full of problems.

Beginning with as soon as we learn something, they change the rules!!
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

And 7Driver, I love it!! ::)

HoneyJo
'I haven't lost my mind, it's backed up on my hard-drive somewhere!'
American Freelance Writer

Re: Google Index Miner
« Reply #5 on: November 23, 2011, 01:35:29 AM »
Oh it's there alright - Google returns 1000 results PER QUERY - you "just" need to do lots of queries and aggregate the results.

If you just do

site:www.mysite.com

Then you'll get a maximum of 1,000 results.  But you could do:

site:www.mysite.com "hamsters"
site:www.mysite.com "new york"
site:www.mysite.com "weight loss"

and it'll return 1,000 pages for each query - they may overlap quite a bit - so you'd need to aggregate and de-dupe the results - but you'd end up with more than 1,000 unique results.  You can also do things like:

site:www.mysite.com inurl:/subdirectory/

to find all the pages google has in its index from a particular subdirectory

site:www.mysite.com intitle:"in washington"

You can do all this stuff manually - but for a big site the data extraction and de-duplication would take a long time to do by hand.   An automated google mining tool could have presets to do other interesting things - not just extracting the whole site - but finding https pages, finding pages with querystrings etc.

Lots of possibilities for getting a better understanding of your site in Google's index.

Re: Google Index Miner
« Reply #6 on: December 11, 2011, 02:59:22 AM »
I don't see a big market for this kind of tool since not many of us have these huge sites. But it is straight-forward to code in say C#.

Possibly, you could use an existing data-mining tool such as OutWit hub to do this?